Papers
arxiv:2308.12966

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Published on Aug 24, 2023
Authors:
,
,

Abstract

Qwen-VL series of large-scale vision-language models excel in image captioning, question answering, and visual localization, outperforming existing models.

AI-generated summary

We introduce the Qwen-VL series, a set of large-scale vision-language models designed to perceive and understand both text and images. Comprising Qwen-VL and Qwen-VL-Chat, these models exhibit remarkable performance in tasks like image captioning, question answering, visual localization, and flexible interaction. The evaluation covers a wide range of tasks including zero-shot captioning, visual or document visual question answering, and grounding. We demonstrate the Qwen-VL outperforms existing Large Vision Language Models (LVLMs). We present their architecture, training, capabilities, and performance, highlighting their contributions to advancing multimodal artificial intelligence. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

Community

Picture1.png

Qwen-VL: Revolutionizing Vision-Language Models

Links πŸ”—:

πŸ‘‰ Subscribe: https://www.youtube.com/@Arxflix
πŸ‘‰ Twitter: https://x.com/arxflix
πŸ‘‰ LMNT (Partner): https://lmnt.com/

By Arxflix
9t4iCUHx_400x400-1.jpg

Sign up or log in to comment

Models citing this paper 153

Browse 153 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2308.12966 in a dataset README.md to link it from this page.

Spaces citing this paper 568

Collections including this paper 16